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Abstract 

It has been designed, built and executed a code for the Fast Fourier 
Transform (FFT), compiled and executed in a cluster of 2™ computers 
under the operating system MacOS and using the routines MacMPI. 
As practical application, the code has been used to obtain the trans- 
formed from an astronomic imagen, to execute a filter on its and with 
a transformed inverse to recover the image with the variates given by 
the filter. The computers arrangement are installed in the Observato- 
rio Astronomico National in Colombia under the name OA N Cluster, 
and in this has been executed several applications. 
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1 Introduction 



In the last few years, the implementation of low-cost computer nets for 
processing in parallel has been growing. Most of the designated implementa- 
tions denominate "Clusters" have been developed in work stations under en- 
vironments of Unix; however today exits Clusters that include the processing 
G3-G4 of Motorola-IBM- Apple, in computers Apple Macintosh running un- 
der MacOS,|2| and also with Pentium III under Linux or Windows. Q 

The OA N Cluster is a project that born in December of 1998 motivated 
by the reached developments for Decyk and his team |J, by the hardware 
(Macs/G3) recently incorporate to the Observatorio Astronomico Nacional 
(OAN) and by the developments in software carried out within the investi- 
gation lines: Theoretical Astrophysics, Galactic Astronomy and Numerical 
Methods of the OAN. With the support of Absoft Corporation M was incor- 
porated an excellent software compiler (Fortran 77/90 and C, C ++); that 
together with the routines MacMPI developed by Decyk in ||] completed 
the development structure of the OAN Cluster. 

The first codes were centered in the knowledge of the own commands of the 
MPI, to evolve in the data distribution and operations with matrix. Seve- 
ral applications have been developed, one of them the denominated "prime 
numbers under the rule of Stanislav Ulam" the one which is found available 
in the Web address of the OAN Q and an algorithm in parallel for the Fast 
Fourier Transform. 



2 Fast Fourier Transform 

Let / :l->Ca continuously function, where R represents the set of real 
numbers and C the complex numbers. The Fourier Transform of / is given 



In most of practical situations, the function / is given in discrete form as 
a finite values collection f(xo), f(xi), . . . , /(xn-i) with N G N, where N is 
the set of natural numbers and {xq, xi, . . . , x^-i} is one partition on an real 



by 




(1) 



where i is the imaginary unity and v the frecuency (7j. 
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interval [a, b] and xj, = a+ ^j^k, for < k < N — 1. In problems that imply 
numerical calculation, instead the equation (|l|) we use the sum partial 

1 N ^ 

H W = ]y E ffri)*-™*'", (2) 

3=0 

designated Discrete Fourier Transform (DFT) of / over the interval [a, b\. 
If / is a function defined on the interval [0, 2tt] of real value with period 2tt, 
the values H(k), by k = 0, . . . , N — 1, can be interpreted as the coefficients 

Ck 

N-l 

c k = jr E < k < N - 1 (3) 

3=0 

of a exponential polynomial 

N-l 

p{x) = cxe lkx , (4) 

fc=0 

where x G M, G C, y A; = 0, 1, . . . , N — 1, which interpole / in the values 
f(xo),f(xi),...,f(x N -x). 

The Discrete Fourier Transform of / over the partition {xo,x%, . . . ,xjy-i}, 
is defined as the operator 

DFT :C N — ► 

such that 

[c , ci, ... , C7V -iF = DFT([f(x ),f(xi), . . . , /(x iV -i)] T ) 

The algorithm that evaluates DFT is denominated Fast Fourier Transform, 
and reduces the calculation significantly. For directly calculations is nece- 
sary to use N 2 multiplications, while FFT only needes NLog2N products. 
A comparison for greater values of N is shown in Table |2[ 

The next tree show how is running the interpolation. In the first level (in- 
dicated by the superscript of the polynomials) are related two data in each 
branch that comes of the zero level and produce a polynomial with two co- 
efficients Pj^\ for < k < 3. In the second level, again the points gene- 

(2) 

rated are interpolated and produces new factors and polynomials P^ , with 

(3) 

< k < 1; Finally we obtain the output vector, where P is the polynomial 
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that we requires 



p(3) 




p(2) p(2) 

/\ /\ 

p(l) p(l) p(l) p(l) 

M) ^2 M *3 

/\ /\ /\ /\ 

p(0) p (0) p (0) p (0) p (0) p (0) p (0) p (0) 

The parallel code works 2 l processors, I > 0. For a better comprehension 
let us consider an example: let C = [C(0), . . . , C(7)] eight data and four 
processors. 

The basic variables that are used, are presented in Table |l|. Then, for this 
example we have: nproc = 4, N = 8, m = 3, ml = 1. 



Table 1: Basic variables 



Variable Description 



nproc Number of processors. 

idproc Label correspondent to each processor. 

We have < idproc < nproc, with the processor equal to MASTER. 

N number of data. 

C Vector with the data. 

Z Vector with factors e Nj ,0<j<N — 1 . 

m Number of tree's levels calculated for 

n Number of level in the process, < n < m. 

ml Number of level until work all processors, m 

nprocsup Number of processors in some levels. 



Log(nproc) 
Log{2) 



Each processor works with its respective branch: 
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idproc = idproc = 2 idproc = 1 idproc = 3 

P (i) p (i) p (i) p (i) 



/\ /\ /\ 

p(0) p (0) p (0) p (0) p (0) p (0) p (0) p (0) 

(7(0) C(4) C(2) (7(6) (7(1) (7(5) (7(3) (7(7) 

In this step, the processor identified with the label idproc generates the 
coefficients' vector: Pjjlroc f° r < idproc < nproc = 4; (n = 1). The 
processing of the information continues until n = ml = 1. In this point 
idproc = 2 transmits its vector to idproc = 0, and idproc = 3 transfers 
its vector to idproc = 1. Before of the transference, each processor stores 
the information in a matrix with two columns where the real part is stored 
in the first column and the imaginary part in the second column. This is 
necessary because the commando to send don't recognize complex numbers 
in language C. 

In this moment idproc = 2 and idproc = 3 don't work. Moreover, nprocsup = 
np ™ c = 2, so, in the level n = 2 the number of processors is reduced to the 
half. Prom now on, where we avanced of level, the middle of processors out 
of the game and the variable nprocsup descend to the middle. 

The processors that receive the information unpack in C. 

idproc = idproc = 1 

p(2) p (2) 

(7(0) (7(2) (7(4) (7(6) 

(7(1) (7(3) (7(5) (7(7) 

/\ /\ 
P (i) P (i) P (i) P (i) 

In the level n = 2, idproc bring about the vector of coefficients -P^ roc for 
< idproc < nprocsup = 2. That which is stated previously, is repeated 
while nprocsup > 1, that is to say, while exist one processor. The MASTER 
receive the communication of idproc = 1 and effects the last calculation to 
bring out P^ roc [idproc = and n = m = 3), in this instant nprocsup = \ 
and the process finish. 
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3 THE INVERSE TRANSFORM 

We affirm that the exponential polinomyal (||) estimates the Inverse Discrete 
Fourier Transform (IDFT), definied as the operator 

IDFT :C N — ► 

such that 

[f(x ), /On), . . . , fixN-^f = idft([c , Ci , . . . , c^i] r ) 



We use the FFT to calculate the IFFT[|. We define x k = ^ for 

l7rfc 
N 



< k < N — 1, and evaluate p in 2tt - x k = 2ir 2 " A ' 



N-l N-l N-l . , N-l 



3=0 j=0 j=0 i=0 

(5) 

In this way obtain the IFFT estimating the FFT multiplied for N and 

. . , 2irk . ( N -k 
p(2ir - x k ) = p{2ir — ) = p[ 2tt 



Cj e 



N V N 
so the inverse transform is in contrary order. 



4 COMPLEXITY COMPUTATIONAL OF FFT 
IN PARALLEL 

If the number of data is N = 2 m (m > 1) and the number of processors is 
2 l (I > 0), the order of complex multiplications effected until the level ml, 
correspond to the multiplications made for each processor with its respective 
branch. The number of multiplications carried out until the level n is n2 m 
for < n < m. Particulary, if n = ml we have 0{m\{2 m )). Then dividing 
for the number of processors, the complexity descend to 

mli ?p = (ml)2 m - 1 = ml(2 ml ) 
1 Inverse Fast Fourier Transform 
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for n = ml. 



Finally we adds the number of multiplications effected in the levels n, for 
ml < n < m. In each level we have 2 m products. Considering that for each 
level n, ml < n < m, the number of processors is reduced to the middle, 
obtain 

l 

^ ^ 2 m i+* 
i=i 

from ml + 1 until n = m. 
Summarizing, 

l 

O ((m - l)2 m - 1 + 2 m ^ +i ) . (6) 

i=l 

Calculating the ratio between sequential FFT and parallel FFT, we have 
NLog 2 N m2 m 



(m - l)2 m ~ l + Y! i=l 2 m ~ l +i (m - l)2 m ~ l + ^- =1 2 m ~ l +i 

m 



(m - l)2-l + Yj i=1 2-1+* 

This indicate that the algorithm in parallel is — r^^n t— faster than 

the sequential algorithm. 

Table 2: Comparative arithmetic products 



TV 


N 2 


NLog 2 N 


Pfp 


RSP FFT 


512 


262144 


4608 


1664 


2.77 


2048 


4194304 


22528 


7680 


2.93 


8192 


67108864 


106496 


34816 


3.06 


32768 


1073741824 


491520 


155648 


3.16 



Pfp: Paralell with four processors. 

RSP FFT: Ratio between sequential and paralell FFT. 
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5 THE TWO-DIMENSIONAL DISCRETE TRANS- 
FORM 2D-DFT 

For two variables the sample is in xy-plane where the sample is uniformly 
distribuited in the parallel straight-lines to the x-axis (rows) and the parallel 
straight-lines y-axis (columns). We define M-nxM^) the matrix of size 
N x M which contain complex numbers. Let f(x,y) : M 2 — > C be a two- 
dimensional function, its 2D-DFT is definied as the operator 

2D - DFT : M NxM (C) — » M NxM (C) 

(f(x,y)) i-> (f(u,v)) 

V y ,y; /NxM V v ' 'J NxM 



where 

N-l M-l 



N 

x=0 y=0 



for u = 0, 1, . . . , N - 1, v = 0, 1, . . . , M - 1. 

And we establish the 2D-IDFT (Inverse Discrete Fourier Transform two- 
dimensional) as 

2D - IDFT : M NxM (C) — ► M NxM (C) 



NxM 



F ( u ' v )) „ „ ^ ( /(*»!/)) 

J NxM V /j 
iV-lM-1 

/(^)4lE^ B ) e * 2l(f+s> ( § ) 



where 

N-l M-l 



u=0 v=0 

for x = 0, 1, . . . , N - I, y = 0, 1, . . . , M - 1. 



The equation (0) we express as 

N-lM-l N-l 

F ^ v ) = Tj E E /(*' y)e- l2 ^e- 2 ^ = 1 ]T tOe"*** (9) 



x=0 y=0 x=0 

where F(x, v) = M^Jg Y^=o /( x > y)e~ l2n ^ 1 • In this way, we may frequent 
use of the DFT on rows and on columns. 

The implementation of the algorithm for the 2D-FFT (Fast Fourier Trans- 
form two-dimensional) is, 

FFT on rows — ► Multiplication for M — > FFT on columns, pfl 



S 



6 APPLICATION 



A application has been diriged to the image processing. We can represent a 
image as a function 

f :R 2 — ► Z 

(x,y) ^ z = f(x,y) 

where Z is the set of integer numbers, (x, y) is the coordinate image's point 
with brightness f(x,y). A digital image, it's a image which x,y,f(x,y) £ 
Z + U {0}. Thus, a image is a two-dimensional array (P x ,y)NxM of pixels 
(picture elements). 

The algorithm pads with zeros on rows and on columns to complete the next 
power of two. 

The Fourier Transform we can represent trought of its spectrum 

y/Re[C(x, y)] 2 + Im[C(x, y)] 2 (10) 

where Re[C(x,y)] is the real part and Im[C(x,y)] is the imaginary part of 
the Transform's element (x,y). 

A images' filter is a operator H : Z 2 — ► R which permits to change the 
brightness in the digital image. For the Convolution Theorem we can use 
a filter and to multiply it with the real and imaginary part image's Fourier 
Transform. A filter can be designed either to eliminate or to create noise in 
a image. We use the filter: 



F(x,y) = 




where (cxjCq) is the image's center and b > 0, |J. 

Finally, we calculate the 2D-IFFT to this product and obtain the image 
filtered. In the next step, we can see the smoothing effect derived using the 
filter over the spiral galaxy NGC5194 (Whirpool Galaxy). This picture has 
a size of 460x506 pixels, and was taken of |lc|| ; in our work we use a filter 
parameter of b = 340. 
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Figure 1: Original galactic picture 




Figure 3: Transformed picture 

7 SUMMARY AND CONCLUSIONS 

1. Arithmetical expression for number of multiplications in the parallel 
FFT was obtained in the equation (^j and calculated the ratio between 
sequential and paralell FFT products (with four processors) for four 
values of N, listed in Table |2[ 

2. We write until the proper functioning the algorithm 2D-FFTp.c. using 
the MPI's routines and tools of MacOS system. 

3. We find that the efficiency increase when the communications has the 
minimum rate of transference executing the code in the cluster. 

4. We build the filter of the equation and had been applied to the 
galaxy NGC5194 obtaining a new image (figure ||) which the high 
frequency's components had been eliminated. 

We acknowledge an anonymous referee for this helpful coments. We thank 
Viktor K. Decyk in UCLA for his useful recommendatios and permanent 
assistance. This work has been supported by DIB of the Universidad Na- 
tional de Colombia through Proyect DIB-803577. 
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